Current mainstream object detection methods for large aerial images usually divide large images into patches and then exhaustively detect the objects of interest on all patches, no matter whether there exist objects or not. This paradigm, although effective, is inefficient because the detectors have to go through all patches, severely hindering the inference speed. This paper presents an Objectness Activation Network (OAN) to help detectors focus on fewer patches but achieve more efficient inference and more accurate results, enabling a simple and effective solution to object detection in large images. In brief, OAN is a light fully-convolutional network for judging whether each patch contains objects or not, which can be easily integrated into many object detectors and jointly trained with them end-to-end. We extensively evaluate our OAN with five advanced detectors. Using OAN, all five detectors acquire more than 30.0% speed-up on three large-scale aerial image datasets, meanwhile with consistent accuracy improvements. On extremely large Gaofen-2 images (29200$\times$27620 pixels), our OAN improves the detection speed by 70.5%. Moreover, we extend our OAN to driving-scene object detection and 4K video object detection, boosting the detection speed by 112.1% and 75.0%, respectively, without sacrificing the accuracy. Code is available at https://github.com/Ranchosky/OAN.
translated by 谷歌翻译
Prompt learning is one of the most effective and trending ways to adapt powerful vision-language foundation models like CLIP to downstream datasets by tuning learnable prompt vectors with very few samples. However, although prompt learning achieves excellent performance over in-domain data, it still faces the major challenge of generalizing to unseen classes and domains. Some existing prompt learning methods tackle this issue by adaptively generating different prompts for different tokens or domains but neglecting the ability of learned prompts to generalize to unseen domains. In this paper, we propose a novel prompt learning paradigm that directly generates domain invariant prompt generalizable to unseen domains, called MetaPrompt. Specifically, a dual-modality prompt tuning network is proposed to generate prompts for inputs from both image and text modalities. More importantly, we propose a meta-learning-based prompt tuning algorithm that explicitly constrains the prompt tuned on a specific domain or class also to achieve good performance on another domain or class. Extensive experiments on 11 datasets for base-to-new generalization and four datasets for domain generalization demonstrate that our method consistently and significantly outperforms existing methods.
translated by 谷歌翻译
This technical report briefly describes our JDExplore d-team's Vega v2 submission on the SuperGLUE leaderboard. SuperGLUE is more challenging than the widely used general language understanding evaluation (GLUE) benchmark, containing eight difficult language understanding tasks, including question answering, natural language inference, word sense disambiguation, coreference resolution, and reasoning. [Method] Instead of arbitrarily increasing the size of a pretrained language model (PLM), our aim is to 1) fully extract knowledge from the input pretraining data given a certain parameter budget, e.g., 6B, and 2) effectively transfer this knowledge to downstream tasks. To achieve goal 1), we propose self-evolution learning for PLMs to wisely predict the informative tokens that should be masked, and supervise the masked language modeling (MLM) process with rectified smooth labels. For goal 2), we leverage the prompt transfer technique to improve the low-resource tasks by transferring the knowledge from the foundation model and related downstream tasks to the target task. [Results] According to our submission record (Oct. 2022), with our optimized pretraining and fine-tuning strategies, our 6B Vega method achieved new state-of-the-art performance on 4/8 tasks, sitting atop the SuperGLUE leaderboard on Oct. 8, 2022, with an average score of 91.3.
translated by 谷歌翻译
How do we know when the predictions made by a classifier can be trusted? This is a fundamental problem that also has immense practical applicability, especially in safety-critical areas such as medicine and autonomous driving. The de facto approach of using the classifier's softmax outputs as a proxy for trustworthiness suffers from the over-confidence issue; while the most recent works incur problems such as additional retraining cost and accuracy versus trustworthiness trade-off. In this work, we argue that the trustworthiness of a classifier's prediction for a sample is highly associated with two factors: the sample's neighborhood information and the classifier's output. To combine the best of both worlds, we design a model-agnostic post-hoc approach NeighborAgg to leverage the two essential information via an adaptive neighborhood aggregation. Theoretically, we show that NeighborAgg is a generalized version of a one-hop graph convolutional network, inheriting the powerful modeling ability to capture the varying similarity between samples within each class. We also extend our approach to the closely related task of mislabel detection and provide a theoretical coverage guarantee to bound the false negative. Empirically, extensive experiments on image and tabular benchmarks verify our theory and suggest that NeighborAgg outperforms other methods, achieving state-of-the-art trustworthiness performance.
translated by 谷歌翻译
Bimanual activities like coffee stirring, which require coordination of dual arms, are common in daily life and intractable to learn by robots. Adopting reinforcement learning to learn these tasks is a promising topic since it enables the robot to explore how dual arms coordinate together to accomplish the same task. However, this field has two main challenges: coordination mechanism and long-horizon task decomposition. Therefore, we propose the Mixline method to learn sub-tasks separately via the online algorithm and then compose them together based on the generated data through the offline algorithm. We constructed a learning environment based on the GPU-accelerated Isaac Gym. In our work, the bimanual robot successfully learned to grasp, hold and lift the spoon and cup, insert them together and stir the coffee. The proposed method has the potential to be extended to other long-horizon bimanual tasks.
translated by 谷歌翻译
Among current anchor-based detectors, a positive anchor box will be intuitively assigned to the object that overlaps it the most. The assigned label to each anchor will directly determine the optimization direction of the corresponding prediction box, including the direction of box regression and category prediction. In our practice of crowded object detection, however, the results show that a positive anchor does not always regress toward the object that overlaps it the most when multiple objects overlap. We name it anchor drift. The anchor drift reflects that the anchor-object matching relation, which is determined by the degree of overlap between anchors and objects, is not always optimal. Conflicts between the fixed matching relation and learned experience in the past training process may cause ambiguous predictions and thus raise the false-positive rate. In this paper, a simple but efficient adaptive two-stage anchor assignment (TSAA) method is proposed. It utilizes the final prediction boxes rather than the fixed anchors to calculate the overlap degree with objects to determine which object to regress for each anchor. The participation of the prediction box makes the anchor-object assignment mechanism adaptive. Extensive experiments are conducted on three classic detectors RetinaNet, Faster-RCNN and YOLOv3 on CrowdHuman and COCO to evaluate the effectiveness of TSAA. The results show that TSAA can significantly improve the detectors' performance without additional computational costs or network structure changes.
translated by 谷歌翻译
共享连接和自动驾驶汽车(CAV)之间的信息从根本上改善了自动驾驶的协作对象检测的性能。但是,由于实际挑战,骑士仍然存在不确定性的对象检测,这将影响自动驾驶中的后来模块,例如计划和控制。因此,不确定性定量对于诸如CAV等安全至关重要系统至关重要。我们的工作是第一个估计协作对象检测的不确定性的工作。我们提出了一种新型的不确定性量化方法,称为Double-M量化,该方法通过直接建模到边界框的每个角落的多变量高斯分布来定制移动块引导(MBB)算法。我们的方法基于离线双M训练过程,通过一个推理通过了一个推理,同时捕获了认知的不确定性和差异不确定性。它可以与不同的协作对象检测器一起使用。通过对综合协作感知数据集进行的实验,我们表明,与最先进的不确定性量化方法相比,我们的双M方法在不确定性评分和3%的准确度上提高了4倍以上。我们的代码在https://coperception.github.io/double-m-quantification上公开。
translated by 谷歌翻译
由于(1)低资源语言的数据稀缺,(2)培训和清爽100+单语言模型的昂贵计算成本,培训和部署混合语音识别的变压器LMS以低资源语言重新排行第二通道是具有挑战性的。,以及(3)考虑流量稀疏的效率低下。在这项研究中,我们提出了一种新的方法,将多个低资源的区域分组在一起,并优化ASR中多语言变压器LMS的性能。我们的本地组多语言变压器LMS的表现优于传统的多语言LM,以及降低维护成本和运营费用。此外,对于部署单语模型的低资源但人口流量的地区是可行的,我们表明,对我们的语言环境组的多语言LMS进行微调可产生比基线单语LMS更好的单语LM候选者。
translated by 谷歌翻译
现有的最佳3D对象检测器通常依赖于多模式融合策略。但是,由于忽略了特定于模式的有用信息,因此从根本上限制了该设计,并最终阻碍了模型性能。为了解决这一局限性,在这项工作中,我们介绍了一种新型的模式相互作用策略,在该策略中,在整个过程中学习和维护单个单模式表示,以使其在物体检测过程中被利用其独特特征。为了实现这一建议的策略,我们设计了一个深层互动体系结构,其特征是多模式代表性交互编码器和多模式预测交互解码器。大规模Nuscenes数据集的实验表明,我们所提出的方法经常超过所有先前的艺术。至关重要的是,我们的方法在竞争激烈的Nuscenes对象检测排行榜上排名第一。
translated by 谷歌翻译
联合学习(FL)使移动设备能够在保留本地数据的同时协作学习共享的预测模型。但是,实际上在移动设备上部署FL存在两个主要的研究挑战:(i)频繁的无线梯度更新v.s.频谱资源有限,以及(ii)培训期间渴望的FL通信和本地计算V.S.电池约束的移动设备。为了应对这些挑战,在本文中,我们提出了一种新型的多位空天空计算(MAIRCOMP)方法,用于FL中本地模型更新的频谱有效聚合,并进一步介绍用于移动的能源有效的FL设计设备。具体而言,高精度数字调制方案是在MAIRCOMP中设计和合并的,允许移动设备同时在多访问通道中同时在所选位置上传模型更新。此外,我们理论上分析了FL算法的收敛性。在FL收敛分析的指导下,我们制定了联合传输概率和局部计算控制优化,旨在最大程度地减少FL移动设备的总体能源消耗(即迭代局部计算 +多轮通信)。广泛的仿真结果表明,我们提出的方案在频谱利用率,能源效率和学习准确性方面优于现有计划。
translated by 谷歌翻译